Computation and Language 57
☆ MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Contrastive pretraining of image-text foundation models, such as CLIP,
demonstrated excellent zero-shot performance and improved robustness on a wide
range of downstream tasks. However, these models utilize large
transformer-based encoders with significant memory and latency overhead which
pose challenges for deployment on mobile devices. In this work, we introduce
MobileCLIP -- a new family of efficient image-text models optimized for runtime
performance along with a novel and efficient training approach, namely
multi-modal reinforced training. The proposed training approach leverages
knowledge transfer from an image captioning model and an ensemble of strong
CLIP encoders to improve the accuracy of efficient models. Our approach avoids
train-time compute overhead by storing the additional knowledge in a reinforced
dataset. MobileCLIP sets a new state-of-the-art latency-accuracy tradeoff for
zero-shot classification and retrieval tasks on several datasets. Our
MobileCLIP-S2 variant is 2.3$\times$ faster while more accurate compared to
previous best CLIP model based on ViT-B/16. We further demonstrate the
effectiveness of our multi-modal reinforced training by training a CLIP model
based on ViT-B/16 image backbone and achieving +2.9% average performance
improvement on 38 evaluation benchmarks compared to the previous best.
Moreover, we show that the proposed approach achieves 10$\times$-1000$\times$
improved learning efficiency when compared with non-reinforced CLIP training.
☆ LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
In this work, we present a novel method to tackle the token generation
challenge in Vision Language Models (VLMs) for video and image understanding,
called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning
and visual question answering, face computational burdens when processing long
videos due to the excessive visual tokens. LLaMA-VID addresses this issue by
representing each frame with two distinct tokens, namely context token and
content token. The context token encodes the overall image context based on
user input, whereas the content token encapsulates visual cues in each frame.
This dual-token strategy significantly reduces the overload of long videos
while preserving critical information. Generally, LLaMA-VID empowers existing
frameworks to support hour-long videos and pushes their upper limit with an
extra context token. It is proved to surpass previous methods on most of video-
or image-based benchmarks. Code is available
https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID
comment: Code is available at https://github.com/dvlab-research/LLaMA-VID
☆ Efficient In-Context Learning in Vision-Language Models for Egocentric Videos
Recent advancements in text-only large language models (LLMs) have
highlighted the benefit of in-context learning for adapting to new tasks with a
few demonstrations. However, extending in-context learning to large
vision-language models (VLMs) using a huge amount of naturalistic
vision-language data has shown limited success, particularly for egocentric
videos, due to high data collection costs. We propose a novel training method
$\mathbb{E}$fficient $\mathbb{I}$n-context $\mathbb{L}$earning on
$\mathbb{E}$gocentric $\mathbb{V}$ideos ($\mathbb{EILEV}$), which elicits
in-context learning in VLMs for egocentric videos without requiring massive,
naturalistic egocentric video datasets. $\mathbb{EILEV}$ involves architectural
and training data adaptations to allow the model to process contexts
interleaved with video clips and narrations, sampling of in-context examples
with clusters of similar verbs and nouns, use of data with skewed marginal
distributions with a long tail of infrequent verbs and nouns, as well as
homonyms and synonyms. Our evaluations show that $\mathbb{EILEV}$-trained
models outperform larger VLMs trained on a huge amount of naturalistic data in
in-context learning. Furthermore, they can generalize to not only
out-of-distribution, but also novel, rare egocentric videos and texts via
in-context learning, demonstrating potential for applications requiring
cost-effective training, and rapid post-deployment adaptability. Our code and
demo are available at \url{https://github.com/yukw777/EILEV}.
☆ Scalable Extraction of Training Data from (Production) Language Models
Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee
This paper studies extractable memorization: training data that an adversary
can efficiently extract by querying a machine learning model without prior
knowledge of the training dataset. We show an adversary can extract gigabytes
of training data from open-source language models like Pythia or GPT-Neo,
semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing
techniques from the literature suffice to attack unaligned models; in order to
attack the aligned ChatGPT, we develop a new divergence attack that causes the
model to diverge from its chatbot-style generations and emit training data at a
rate 150x higher than when behaving properly. Our methods show practical
attacks can recover far more data than previously thought, and reveal that
current alignment techniques do not eliminate memorization.
☆ Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching NeurIPS 2023
Mechanistic interpretability aims to understand model behaviors in terms of
specific, interpretable features, often hypothesized to manifest as
low-dimensional subspaces of activations. Specifically, recent studies have
explored subspace interventions (such as activation patching) as a way to
simultaneously manipulate model behavior and attribute the features behind it
to given subspaces.
In this work, we demonstrate that these two aims diverge, potentially leading
to an illusory sense of interpretability. Counterintuitively, even if a
subspace intervention makes the model's output behave as if the value of a
feature was changed, this effect may be achieved by activating a dormant
parallel pathway leveraging another subspace that is causally disconnected from
model outputs. We demonstrate this phenomenon in a distilled mathematical
example, in two real-world domains (the indirect object identification task and
factual recall), and present evidence for its prevalence in practice. In the
context of factual recall, we further show a link to rank-1 fact editing,
providing a mechanistic explanation for previous work observing an
inconsistency between fact editing performance and fact localization.
However, this does not imply that activation patching of subspaces is
intrinsically unfit for interpretability. To contextualize our findings, we
also show what a success case looks like in a task (indirect object
identification) where prior manual circuit analysis informs an understanding of
the location of a feature. We explore the additional evidence needed to argue
that a patched subspace is faithful.
comment: NeurIPS 2023 Workshop on Attributing Model Behavior at Scale
☆ ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?
Hailin Chen, Fangkai Jiao, Xingxuan Li, Chengwei Qin, Mathieu Ravaut, Ruochen Zhao, Caiming Xiong, Shafiq Joty
Upon its release in late 2022, ChatGPT has brought a seismic shift in the
entire landscape of AI, both in research and commerce. Through
instruction-tuning a large language model (LLM) with supervised fine-tuning and
reinforcement learning from human feedback, it showed that a model could answer
human questions and follow instructions on a broad panel of tasks. Following
this success, interests in LLMs have intensified, with new LLMs flourishing at
frequent interval across academia and industry, including many start-ups
focused on LLMs. While closed-source LLMs (e.g., OpenAI's GPT, Anthropic's
Claude) generally outperform their open-source counterparts, the progress on
the latter has been rapid with claims of achieving parity or even better on
certain tasks. This has crucial implications not only on research but also on
business. In this work, on the first anniversary of ChatGPT, we provide an
exhaustive overview of this success, surveying all tasks where an open-source
LLM has claimed to be on par or better than ChatGPT.
☆ Assessing the influence of attractor-verb distance on grammatical agreement in humans and language models EMNLP 2023
Subject-verb agreement in the presence of an attractor noun located between
the main noun and the verb elicits complex behavior: judgments of
grammaticality are modulated by the grammatical features of the attractor. For
example, in the sentence "The girl near the boys likes climbing", the attractor
(boys) disagrees in grammatical number with the verb (likes), creating a
locally implausible transition probability. Here, we parametrically modulate
the distance between the attractor and the verb while keeping the length of the
sentence equal. We evaluate the performance of both humans and two artificial
neural network models: both make more mistakes when the attractor is closer to
the verb, but neural networks get close to the chance level while humans are
mostly able to overcome the attractor interference. Additionally, we report a
linear effect of attractor distance on reaction times. We hypothesize that a
possible reason for the proximity effect is the calculation of transition
probabilities between adjacent words. Nevertheless, classical models of
attraction such as the cue-based model might suffice to explain this
phenomenon, thus paving the way for new research. Data and analyses available
at https://osf.io/d4g6k
comment: 10 pages (5 main, 2 refs, 3 supplementary) ; 5 figures (3 main, 2
supplementary) ; accepted at EMNLP 2023 (no DOI yet)
☆ Natural Language Processing Through Transfer Learning: A Case Study on Sentiment Analysis
Artificial intelligence and machine learning have significantly bolstered the
technological world. This paper explores the potential of transfer learning in
natural language processing focusing mainly on sentiment analysis. The models
trained on the big data can also be used where data are scarce. The claim is
that, compared to training models from scratch, transfer learning, using
pre-trained BERT models, can increase sentiment classification accuracy. The
study adopts a sophisticated experimental design that uses the IMDb dataset of
sentimentally labelled movie reviews. Pre-processing includes tokenization and
encoding of text data, making it suitable for NLP models. The dataset is used
on a BERT based model, measuring its performance using accuracy. The result
comes out to be 100 per cent accurate. Although the complete accuracy could
appear impressive, it might be the result of overfitting or a lack of
generalization. Further analysis is required to ensure the model's ability to
handle diverse and unseen data. The findings underscore the effectiveness of
transfer learning in NLP, showcasing its potential to excel in sentiment
analysis tasks. However, the research calls for a cautious interpretation of
perfect accuracy and emphasizes the need for additional measures to validate
the model's generalization.
comment: 12 pages, 1 table, 4 figures
☆ Debiasing Multimodal Models via Causal Information Minimization EMNLP 2023
Most existing debiasing methods for multimodal models, including causal
intervention and inference methods, utilize approximate heuristics to represent
the biases, such as shallow features from early stages of training or unimodal
features for multimodal tasks like VQA, etc., which may not be accurate. In
this paper, we study bias arising from confounders in a causal graph for
multimodal data and examine a novel approach that leverages causally-motivated
information minimization to learn the confounder representations. Robust
predictive features contain diverse information that helps a model generalize
to out-of-distribution data. Hence, minimizing the information content of
features obtained from a pretrained biased model helps learn the simplest
predictive features that capture the underlying data distribution. We treat
these features as confounder representations and use them via methods motivated
by causal theory to remove bias from models. We find that the learned
confounder representations indeed capture dataset biases, and the proposed
debiasing methods improve out-of-distribution (OOD) performance on multiple
multimodal datasets without sacrificing in-distribution performance.
Additionally, we introduce a novel metric to quantify the sufficiency of
spurious features in models' predictions that further demonstrates the
effectiveness of our proposed methods. Our code is available at:
https://github.com/Vaidehi99/CausalInfoMin
comment: EMNLP 2023 Findings (16 pages)
☆ Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Large Vision-Language Models (LVLMs) have advanced considerably, intertwining
visual recognition and language understanding to generate content that is not
only coherent but also contextually attuned. Despite their success, LVLMs still
suffer from the issue of object hallucinations, where models generate plausible
yet incorrect outputs that include objects that do not exist in the images. To
mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple
and training-free method that contrasts output distributions derived from
original and distorted visual inputs. The proposed VCD effectively reduces the
over-reliance on statistical bias and unimodal priors, two essential causes of
object hallucinations. This adjustment ensures the generated content is closely
grounded to visual inputs, resulting in contextually accurate outputs. Our
experiments show that VCD, without either additional training or the usage of
external tools, significantly mitigates the object hallucination issue across
different LVLM families. Beyond mitigating object hallucinations, VCD also
excels in general LVLM benchmarks, highlighting its wide-ranging applicability.
☆ Optimisation-Based Multi-Modal Semantic Image Editing
Image editing affords increased control over the aesthetics and content of
generated images. Pre-existing works focus predominantly on text-based
instructions to achieve desired image modifications, which limit edit precision
and accuracy. In this work, we propose an inference-time editing optimisation,
designed to extend beyond textual edits to accommodate multiple editing
instruction types (e.g. spatial layout-based; pose, scribbles, edge maps). We
propose to disentangle the editing task into two competing subtasks: successful
local image modifications and global content consistency preservation, where
subtasks are guided through two dedicated loss functions. By allowing to adjust
the influence of each loss function, we build a flexible editing solution that
can be adjusted to user preferences. We evaluate our method using text, pose
and scribble edit conditions, and highlight our ability to achieve complex
edits, through both qualitative and quantitative experiments.
☆ The Falcon Series of Open Language Models
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, Guilherme Penedo
We introduce the Falcon series: 7B, 40B, and 180B parameters causal
decoder-only models trained on a diverse high-quality corpora predominantly
assembled from web data. The largest model, Falcon-180B, has been trained on
over 3.5 trillion tokens of text--the largest openly documented pretraining
run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla,
and improves upon concurrently developed models such as LLaMA 2 or
Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining
and inference cost, making it, to our knowledge, one of the three best language
models in the world along with GPT-4 and PaLM-2-Large. We report detailed
evaluations, as well as a deep dive into the methods and custom tooling
employed to pretrain Falcon. Notably, we report on our custom distributed
training codebase, allowing us to efficiently pretrain these models on up to
4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a
600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models
under a permissive license to foster open-science and accelerate the
development of an open ecosystem of large language models.
☆ A Benchmark for Evaluating Machine Translation Metrics on Dialects Without Standard Orthography
For sensible progress in natural language processing, it is important that we
are aware of the limitations of the evaluation metrics we use. In this work, we
evaluate how robust metrics are to non-standardized dialects, i.e. spelling
differences in language varieties that do not have a standard orthography. To
investigate this, we collect a dataset of human translations and human
judgments for automatic machine translations from English to two Swiss German
dialects. We further create a challenge set for dialect variation and benchmark
existing metrics' performances. Our results show that existing metrics cannot
reliably evaluate Swiss German text generation outputs, especially on segment
level. We propose initial design adaptations that increase robustness in the
face of non-standardized dialects, although there remains much room for further
improvement. The dataset, code, and models are available here:
https://github.com/textshuttle/dialect_eval
comment: WMT 2023 Research Paper
☆ RELIC: Investigating Large Language Model Responses using Self-Consistency
Large Language Models (LLMs) are notorious for blending fact with fiction and
generating non-factual content, known as hallucinations. To tackle this
challenge, we propose an interactive system that helps users obtain insights
into the reliability of the generated text. Our approach is based on the idea
that the self-consistency of multiple samples generated by the same LLM relates
to its confidence in individual claims in the generated texts. Using this idea,
we design RELIC, an interactive system that enables users to investigate and
verify semantic-level variations in multiple long-form responses. This allows
users to recognize potentially inaccurate information in the generated text and
make necessary corrections. From a user study with ten participants, we
demonstrate that our approach helps users better verify the reliability of the
generated text. We further summarize the design implications and lessons
learned from this research for inspiring future studies on reliable human-LLM
interactions.
☆ The Claire French Dialogue Dataset
We present the Claire French Dialogue Dataset (CFDD), a resource created by
members of LINAGORA Labs in the context of the OpenLLM France initiative. CFDD
is a corpus containing roughly 160 million words from transcripts and stage
plays in French that we have assembled and publicly released in an effort to
further the development of multilingual, open source language models. This
paper describes the 24 individual corpora of which CFDD is composed and
provides links and citations to their original sources. It also provides our
proposed breakdown of the full CFDD dataset into eight categories of subcorpora
and describes the process we followed to standardize the format of the final
dataset. We conclude with a discussion of similar work and future directions.
☆ Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Multimodal large language models have made significant advancements in recent
years, yet they still suffer from a common issue known as the "hallucination
problem" where the models generate textual descriptions that contain inaccurate
or non-existent content from the image. To address this issue, this paper
introduces a novel strategy: Hallucination-Aware Direct Preference Optimization
(HA-DPO). Our approach treats the hallucination problem as a unique preference
selection issue, where the model is trained to favor the non-hallucinating
response when presented with two responses of the same image (one accurate and
one hallucinating). This paper also presents an efficient process for
constructing hallucination sample pairs to ensure high-quality,
style-consistent pairs for stable HA-DPO training. We applied this strategy to
two mainstream multimodal models, and the results showed a significant
reduction in the hallucination problem and an enhancement in the models'
generalization capabilities. With HA-DPO, the MiniGPT-4 model demonstrates
significant advancements: POPE accuracy increases from 51.13% to 85.66% (34.5%
absolute improvement), and the MME score escalates from 968.58 to 1365.76 (41%
relative improvement). The code, models, and datasets will be made publicly
available.
comment: Preprint
☆ CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models
Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Libiao Peng, Jiaming Yang, Xiyao Xiao, Sahand Sabour, Xiaohan Zhang, Wenjing Hou, Yijia Zhang, Yuxiao Dong, Jie Tang, Minlie Huang
In this paper, we present CharacterGLM, a series of models built upon
ChatGLM, with model sizes ranging from 6B to 66B parameters. Our CharacterGLM
is designed for generating Character-based Dialogues (CharacterDial), which
aims to equip a conversational AI system with character customization for
satisfying people's inherent social desires and emotional needs. On top of
CharacterGLM, we can customize various AI characters or social agents by
configuring their attributes (identities, interests, viewpoints, experiences,
achievements, social relationships, etc.) and behaviors (linguistic features,
emotional expressions, interaction patterns, etc.). Our model outperforms most
mainstream close-source large langauge models, including the GPT series,
especially in terms of consistency, human-likeness, and engagement according to
manual evaluations. We will release our 6B version of CharacterGLM and a subset
of training data to facilitate further research development in the direction of
character-based dialogue generation.
comment: Work in progress
☆ Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop
Large language models (LLM) have become state of the art in many benchmarks
and conversational LLM applications like ChatGPT are now widely used by the
public. Those LLMs can be used to generate large amounts of content which is
posted on the internet to various platforms. As LLMs are trained on datasets
usually collected from the internet, this LLM-generated content might be used
to train the next generation of LLMs. Therefore, a self-consuming training loop
emerges in which new LLM generations are trained on the output from the
previous generations. We empirically study this self-consuming training loop
using a novel dataset to analytically and accurately measure quality and
diversity of generated outputs. We find that this self-consuming training loop
initially improves both quality and diversity. However, after a few generations
the output inevitably degenerates in diversity. We find that the rate of
degeneration depends on the proportion of real and generated data.
☆ A Survey of the Evolution of Language Model-Based Dialogue Systems
Dialogue systems, including task-oriented_dialogue_system (TOD) and
open-domain_dialogue_system (ODD), have undergone significant transformations,
with language_models (LM) playing a central role. This survey delves into the
historical trajectory of dialogue systems, elucidating their intricate
relationship with advancements in language models by categorizing this
evolution into four distinct stages, each marked by pivotal LM breakthroughs:
1) Early_Stage: characterized by statistical LMs, resulting in rule-based or
machine-learning-driven dialogue_systems; 2) Independent development of TOD and
ODD based on neural_language_models (NLM; e.g., LSTM and GRU), since NLMs lack
intrinsic knowledge in their parameters; 3) fusion between different types of
dialogue systems with the advert of pre-trained_language_models (PLMs),
starting from the fusion between four_sub-tasks_within_TOD, and then
TOD_with_ODD; and 4) current LLM-based_dialogue_system, wherein LLMs can be
used to conduct TOD and ODD seamlessly. Thus, our survey provides a
chronological perspective aligned with LM breakthroughs, offering a
comprehensive review of state-of-the-art research outcomes. What's more, we
focus on emerging topics and discuss open challenges, providing valuable
insights into future directions for LLM-based_dialogue_systems. Through this
exploration, we pave the way for a deeper_comprehension of the evolution,
guiding future developments in LM-based dialogue_systems.
☆ Evaluating Optimal Reference Translations
The overall translation quality reached by current machine translation (MT)
systems for high-resourced language pairs is remarkably good. Standard methods
of evaluation are not suitable nor intended to uncover the many translation
errors and quality deficiencies that still persist. Furthermore, the quality of
standard reference translations is commonly questioned and comparable quality
levels have been reached by MT alone in several language pairs. Navigating
further research in these high-resource settings is thus difficult. In this
article, we propose a methodology for creating more reliable document-level
human reference translations, called "optimal reference translations," with the
simple aim to raise the bar of what should be deemed "human translation
quality." We evaluate the obtained document-level optimal reference
translations in comparison with "standard" ones, confirming a significant
quality increase and also documenting the relationship between evaluation and
translation editing.
comment: To appear in Natural Language Engineering 2024
☆ Radiology-Aware Model-Based Evaluation Metric for Report Generation
Amos Calamida, Farhad Nooralahzadeh, Morteza Rohanian, Koji Fujimoto, Mizuho Nishio, Michael Krauthammer
We propose a new automated evaluation metric for machine-generated radiology
reports using the successful COMET architecture adapted for the radiology
domain. We train and publish four medically-oriented model checkpoints,
including one trained on RadGraph, a radiology knowledge graph. Our results
show that our metric correlates moderately to high with established metrics
such as BERTscore, BLEU, and CheXbert scores. Furthermore, we demonstrate that
one of our checkpoints exhibits a high correlation with human judgment, as
assessed using the publicly available annotations of six board-certified
radiologists, using a set of 200 reports. We also performed our own analysis
gathering annotations with two radiologists on a collection of 100 reports. The
results indicate the potential effectiveness of our method as a
radiology-specific evaluation metric. The code, data, and model checkpoints to
reproduce our findings will be publicly available.
comment: 9 pages
☆ LLMs for Science: Usage for Code Generation and Data Analysis
Large language models (LLMs) have been touted to enable increased
productivity in many areas of today's work life. Scientific research as an area
of work is no exception: the potential of LLM-based tools to assist in the
daily work of scientists has become a highly discussed topic across
disciplines. However, we are only at the very onset of this subject of study.
It is still unclear how the potential of LLMs will materialise in research
practice. With this study, we give first empirical evidence on the use of LLMs
in the research process. We have investigated a set of use cases for LLM-based
tools in scientific research, and conducted a first study to assess to which
degree current tools are helpful. In this paper we report specifically on use
cases related to software engineering, such as generating application code and
developing scripts for data analytics. While we studied seemingly simple use
cases, results across tools differ significantly. Our results highlight the
promise of LLM-based tools in general, yet we also observe various issues,
particularly regarding the integrity of the output these tools provide.
comment: Preprint; In Submission
☆ Entity-Aspect-Opinion-Sentiment Quadruple Extraction for Fine-grained Sentiment Analysis
Product reviews often contain a large number of implicit aspects and
object-attribute co-existence cases. Unfortunately, many existing studies in
Aspect-Based Sentiment Analysis (ABSA) have overlooked this issue, which can
make it difficult to extract opinions comprehensively and fairly. In this
paper, we propose a new task called Entity-Aspect-Opinion-Sentiment Quadruple
Extraction (EASQE), which aims to hierarchically decompose aspect terms into
entities and aspects to avoid information loss, non-exclusive annotations, and
opinion misunderstandings in ABSA tasks. To facilitate research in this new
task, we have constructed four datasets (Res14-EASQE, Res15-EASQE, Res16-EASQE,
and Lap14-EASQE) based on the SemEval Restaurant and Laptop datasets. We have
also proposed a novel two-stage sequence-tagging based Trigger-Opinion
framework as the baseline for the EASQE task. Empirical evaluations show that
our Trigger-Opinion framework can generate satisfactory EASQE results and can
also be applied to other ABSA tasks, significantly outperforming
state-of-the-art methods. We have made the four datasets and source code of
Trigger-Opinion publicly available to facilitate further research in this area.
☆ A Distribution-Based Threshold for Determining Sentence Similarity
We hereby present a solution to a semantic textual similarity (STS) problem
in which it is necessary to match two sentences containing, as the only
distinguishing factor, highly specific information (such as names, addresses,
identification codes), and from which we need to derive a definition for when
they are similar and when they are not. The solution revolves around the use of
a neural network, based on the siamese architecture, to create the
distributions of the distances between similar and dissimilar pairs of
sentences. The goal of these distributions is to find a discriminating factor,
that we call "threshold", which represents a well-defined quantity that can be
used to distinguish vector distances of similar pairs from vector distances of
dissimilar pairs in new predictions and later analyses. In addition, we
developed a way to score the predictions by combining attributes from both the
distributions' features and the way the distance function works. Finally, we
generalize the results showing that they can be transferred to a wider range of
domains by applying the system discussed to a well-known and widely used
benchmark dataset for STS problems.
☆ Text2Tree: Aligning Text Representation to the Label Tree Hierarchy for Imbalanced Medical Classification EMNLP 2023
Deep learning approaches exhibit promising performances on various text
tasks. However, they are still struggling on medical text classification since
samples are often extremely imbalanced and scarce. Different from existing
mainstream approaches that focus on supplementary semantics with external
medical information, this paper aims to rethink the data challenges in medical
texts and present a novel framework-agnostic algorithm called Text2Tree that
only utilizes internal label hierarchy in training deep learning models. We
embed the ICD code tree structure of labels into cascade attention modules for
learning hierarchy-aware label representations. Two new learning schemes,
Similarity Surrogate Learning (SSL) and Dissimilarity Mixup Learning (DML), are
devised to boost text classification by reusing and distinguishing samples of
other labels following the label representation hierarchy, respectively.
Experiments on authoritative public datasets and real-world medical records
show that our approach stably achieves superior performances over classical and
advanced imbalanced classification methods.
comment: EMNLP 2023 Findings. Code: https://github.com/jyansir/Text2Tree
☆ Scaling Political Texts with ChatGPT
We use GPT-4 to obtain position estimates of political texts in continuous
spaces. We develop and validate a new approach by positioning British party
manifestos on the economic, social, and immigration policy dimensions and
tweets by members of the US Congress on the left-right ideological spectrum.
For the party manifestos, the correlation between the positions produced by
GPT-4 and experts is 93% or higher, a performance similar to or better than
that obtained with crowdsourced position estimates. For individual tweets, the
positions obtained with GPT-4 achieve a correlation of 91% with crowdsourced
position estimates. For senators of the 117th US Congress, the positions
obtained with GPT-4 achieve a correlation of 97% with estimates based on roll
call votes and of 96% with those based on campaign funding. Correlations are
also substantial within party, indicating that position estimates produced with
GPT-4 capture within-party differences between senators. Overall, using GPT-4
for ideological scaling is fast, cost-efficient, and reliable. This approach
provides a viable alternative to scaling by both expert raters and
crowdsourcing.
☆ On the Long Range Abilities of Transformers
Despite their dominance in modern DL and, especially, NLP domains,
transformer architectures exhibit sub-optimal performance on long-range tasks
compared to recent layers that are specifically designed for this purpose. In
this work, drawing inspiration from key attributes of long-range layers, such
as state-space layers, linear RNN layers, and global convolution layers, we
demonstrate that minimal modifications to the transformer architecture can
significantly enhance performance on the Long Range Arena (LRA) benchmark, thus
narrowing the gap with these specialized layers. We identify that two key
principles for long-range tasks are (i) incorporating an inductive bias towards
smoothness, and (ii) locality. As we show, integrating these ideas into the
attention mechanism improves results with a negligible amount of additional
computation and without any additional trainable parameters. Our theory and
experiments also shed light on the reasons for the inferior performance of
transformers on long-range tasks and identify critical properties that are
essential for successfully capturing long-range dependencies.
comment: 18 pages
☆ MedGen: A Python Natural Language Processing Toolkit for Medical Text Processing
Rui Yang, Qingcheng Zeng, Keen You, Yujie Qiao, Lucas Huang, Chia-Chun Hsieh, Benjamin Rosand, Jeremy Goldwasser, Amisha D Dave, Tiarnan D. L. Keenan, Emily Y Chew, Dragomir Radev, Zhiyong Lu, Hua Xu, Qingyu Chen, Irene Li
This study introduces MedGen, a comprehensive natural language processing
(NLP) toolkit designed for medical text processing. MedGen is tailored for
biomedical researchers and healthcare professionals with an easy-to-use,
all-in-one solution that requires minimal programming expertise. It includes
(1) Generative Functions: For the first time, MedGen includes four advanced
generative functions: question answering, text summarization, text
simplification, and machine translation; (2) Basic NLP Functions: MedGen
integrates 12 essential NLP functions such as word tokenization and sentence
segmentation; and (3) Query and Search Capabilities: MedGen provides
user-friendly query and search functions on text corpora. We fine-tuned 32
domain-specific language models, evaluated them thoroughly on 24 established
benchmarks and conducted manual reviews with clinicians. Additionally, we
expanded our toolkit by introducing query and search functions, while also
standardizing and integrating functions from third-party libraries. The
toolkit, its models, and associated data are publicly available via
https://github.com/Yale-LILY/MedGen.
comment: 5 figures, 4 tables
☆ Recognizing Conditional Causal Relationships about Emotions and Their Corresponding Conditions
The study of causal relationships between emotions and causes in texts has
recently received much attention. Most works focus on extracting causally
related clauses from documents. However, none of these works has considered
that the causal relationships among the extracted emotion and cause clauses can
only be valid under some specific context clauses. To highlight the context in
such special causal relationships, we propose a new task to determine whether
or not an input pair of emotion and cause has a valid causal relationship under
different contexts and extract the specific context clauses that participate in
the causal relationship. Since the task is new for which no existing dataset is
available, we conduct manual annotation on a benchmark dataset to obtain the
labels for our tasks and the annotations of each context clause's type that can
also be used in some other applications. We adopt negative sampling to
construct the final dataset to balance the number of documents with and without
causal relationships. Based on the constructed dataset, we propose an
end-to-end multi-task framework, where we design two novel and general modules
to handle the two goals of our task. Specifically, we propose a context masking
module to extract the context clauses participating in the causal
relationships. We propose a prediction aggregation module to fine-tune the
prediction results according to whether the input emotion and causes depend on
specific context clauses. Results of extensive comparative experiments and
ablation studies demonstrate the effectiveness and generality of our proposed
framework.
☆ Evaluation of dynamic characteristics of power grid based on GNN and application on knowledge graph
A novel method for detecting faults in power grids using a graph neural
network (GNN) has been developed, aimed at enhancing intelligent fault
diagnosis in network operation and maintenance. This GNN-based approach
identifies faulty nodes within the power grid through a specialized electrical
feature extraction model coupled with a knowledge graph. Incorporating temporal
data, the method leverages the status of nodes from preceding and subsequent
time periods to aid in current fault detection. To validate the effectiveness
of this GNN in extracting node features, a correlation analysis of the output
features from each node within the neural network layer was conducted. The
results from experiments show that this method can accurately locate fault
nodes in simulated scenarios with a remarkable 99.53% accuracy. Additionally,
the graph neural network's feature modeling allows for a qualitative
examination of how faults spread across nodes, providing valuable insights for
analyzing fault nodes.
☆ StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models ICASSP 2024
We propose StyleCap, a method to generate natural language descriptions of
speaking styles appearing in speech. Although most of conventional techniques
for para-/non-linguistic information recognition focus on the category
classification or the intensity estimation of pre-defined labels, they cannot
provide the reasoning of the recognition result in an interpretable manner. As
a first step towards an end-to-end method for generating speaking-style prompts
from speech, i.e., automatic speaking-style captioning, StyleCap uses paired
data of speech and natural language descriptions to train neural networks that
predict prefix vectors fed into a large language model (LLM)-based text decoder
from a speech representation vector. We explore an appropriate text decoder and
speech feature representation suitable for this new task. The experimental
results demonstrate that our StyleCap leveraging richer LLMs for the text
decoder, speech self-supervised learning (SSL) features, and sentence
rephrasing augmentation improves the accuracy and diversity of generated
speaking-style captions. Samples of speaking-style captions generated by our
StyleCap are publicly available.
comment: Submitted to ICASSP 2024
☆ Enhancing Human Persuasion With Large Language Models
Although large language models (LLMs) are reshaping various aspects of human
life, our current understanding of their impacts remains somewhat constrained.
Here we investigate the impact of LLMs on human communication, in the context
of consumer complaints in the financial industry. Employing an AI detection
tool on more than 780K complaints gathered by the Consumer Financial Protection
Bureau (CFPB), we find evidence of LLM usage in the writing of complaints -
shortly after the release of ChatGPT. Our analyses reveal that LLM usage is
positively correlated with the likelihood of obtaining desirable outcomes
(i.e., offer of relief from financial firms) and suggest that this positive
correlation may be partly due to the linguistic features improved by LLMs. We
test this conjecture with a preregistered experiment, which reveals results
consistent with those from observational studies: Consumer complaints written
with ChatGPT for improved linguistic qualities were more likely to receive
hypothetical relief offers than the original consumer complaints, demonstrating
the LLM's ability to enhance message persuasiveness in human communication.
Being some of the earliest empirical evidence on LLM usage for enhancing
persuasion, our results highlight the transformative potential of LLMs in human
communication.
☆ Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, Eric Horvitz
Generalist foundation models such as GPT-4 have displayed surprising
capabilities in a wide variety of domains and tasks. Yet, there is a prevalent
assumption that they cannot match specialist capabilities of fine-tuned models.
For example, most explorations to date on medical competency benchmarks have
leveraged domain-specific training, as exemplified by efforts on BioGPT and
Med-PaLM. We build on a prior study of GPT-4's capabilities on medical
challenge benchmarks in the absence of special training. Rather than using
simple prompting to highlight the model's out-of-the-box capabilities, we
perform a systematic exploration of prompt engineering. We find that prompting
innovation can unlock deeper specialist capabilities and show that GPT-4 easily
tops prior leading results for medical benchmarks. The prompting methods we
explore are general purpose, and make no specific use of domain expertise,
removing the need for expert-curated content. Our experimental design carefully
controls for overfitting during the prompt engineering process. We introduce
Medprompt, based on a composition of several prompting strategies. With
Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark
datasets in the MultiMedQA suite. The method outperforms leading specialist
models such as Med-PaLM 2 by a significant margin with an order of magnitude
fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27%
reduction in error rate on the MedQA dataset over the best methods to date
achieved with specialist models and surpasses a score of 90% for the first
time. Beyond medical problems, we show the power of Medprompt to generalize to
other domains and provide evidence for the broad applicability of the approach
via studies of the strategy on exams in electrical engineering, machine
learning, philosophy, accounting, law, nursing, and clinical psychology.
comment: 21 pages, 7 figures
☆ Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos
Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, Yoichi Sato
We propose a novel benchmark for cross-view knowledge transfer of dense video
captioning, adapting models from web instructional videos with exocentric views
to an egocentric view. While dense video captioning (predicting time segments
and their captions) is primarily studied with exocentric videos (e.g.,
YouCook2), benchmarks with egocentric videos are restricted due to data
scarcity. To overcome the limited video availability, transferring knowledge
from abundant exocentric web videos is demanded as a practical approach.
However, learning the correspondence between exocentric and egocentric views is
difficult due to their dynamic view changes. The web videos contain mixed views
focusing on either human body actions or close-up hand-object interactions,
while the egocentric view is constantly shifting as the camera wearer moves.
This necessitates the in-depth study of cross-view transfer under complex view
changes. In this work, we first create a real-life egocentric dataset (EgoYC2)
whose captions are shared with YouCook2, enabling transfer learning between
these datasets assuming their ground-truth is accessible. To bridge the view
gaps, we propose a view-invariant learning method using adversarial training in
both the pre-training and fine-tuning stages. While the pre-training is
designed to learn invariant features against the mixed views in the web videos,
the view-invariant fine-tuning further mitigates the view gaps between both
datasets. We validate our proposed method by studying how effectively it
overcomes the view change problem and efficiently transfers the knowledge to
the egocentric domain. Our benchmark pushes the study of the cross-view
transfer into a new task domain of dense video captioning and will envision
methodologies to describe egocentric videos in natural language.
☆ CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models
As the scaling of Large Language Models (LLMs) has dramatically enhanced
their capabilities, there has been a growing focus on the alignment problem to
ensure their responsible and ethical use. While existing alignment efforts
predominantly concentrate on universal values such as the HHH principle, the
aspect of culture, which is inherently pluralistic and diverse, has not
received adequate attention. This work introduces a new benchmark, CDEval,
aimed at evaluating the cultural dimensions of LLMs. CDEval is constructed by
incorporating both GPT-4's automated generation and human verification,
covering six cultural dimensions across seven domains. Our comprehensive
experiments provide intriguing insights into the culture of mainstream LLMs,
highlighting both consistencies and variations across different dimensions and
domains. The findings underscore the importance of integrating cultural
considerations in LLM development, particularly for applications in diverse
cultural settings. Through CDEval, we aim to broaden the horizon of LLM
alignment research by including cultural dimensions, thus providing a more
holistic framework for the future development and evaluation of LLMs. This
benchmark serves as a valuable resource for cultural studies in LLMs, paving
the way for more culturally aware and sensitive models.
comment: Work in process
♻ ☆ What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
Counterfactual reasoning, a fundamental aspect of human cognition, involves
contemplating alternatives to established facts or past events, significantly
enhancing our abilities in planning and decision-making. In light of the
advancements in current multi-modal large language models, we explore their
effectiveness in counterfactual reasoning. To facilitate this investigation, we
introduce a novel dataset, C-VQA, specifically designed to test the
counterfactual reasoning capabilities of modern multi-modal large language
models. This dataset is constructed by infusing original questions with
counterfactual presuppositions, spanning various types such as numerical and
boolean queries. It encompasses a mix of real and synthetic data, representing
a wide range of difficulty levels. Our thorough evaluations of contemporary
vision-language models using this dataset have revealed substantial performance
drops, with some models showing up to a 40% decrease, highlighting a
significant gap between current models and human-like vision reasoning
capabilities. We hope our dataset will serve as a vital benchmark for
evaluating the counterfactual reasoning capabilities of models. Code and
dataset are publicly available at https://bzhao.me/C-VQA/.
♻ ☆ The effect of source disclosure on evaluation of AI-generated messages: A two-part study
Advancements in artificial intelligence (AI) over the last decade demonstrate
that machines can exhibit communicative behavior and influence how humans
think, feel, and behave. In fact, the recent development of ChatGPT has shown
that large language models (LLMs) can be leveraged to generate high-quality
communication content at scale and across domains, suggesting that they will be
increasingly used in practice. However, many questions remain about how knowing
the source of the messages influences recipients' evaluation of and preference
for AI-generated messages compared to human-generated messages. This paper
investigated this topic in the context of vaping prevention messaging. In Study
1, which was pre-registered, we examined the influence of source disclosure on
people's evaluation of AI-generated health prevention messages compared to
human-generated messages. We found that source disclosure (i.e., labeling the
source of a message as AI vs. human) significantly impacted the evaluation of
the messages but did not significantly alter message rankings. In a follow-up
study (Study 2), we examined how the influence of source disclosure may vary by
the participants' negative attitudes towards AI. We found a significant
moderating effect of negative attitudes towards AI on message evaluation, but
not for message selection. However, for those with moderate levels of negative
attitudes towards AI, source disclosure decreased the preference for
AI-generated messages. Overall, the results of this series of studies showed a
slight bias against AI-generated messages once the source was disclosed, adding
to the emerging area of study that lies at the intersection of AI and
communication.
comment: Manuscript currently under review. Paper presented at 109th Annual
National Communication Association (NCA) Conference, November 16-19, 2023. 10
pages, 5 figures. Supplementary file formatting updated in current version
♻ ☆ A Brief History of Prompt: Leveraging Language Models. (Through Advanced Prompting)
This paper presents a comprehensive exploration of the evolution of prompt
engineering and generation in the field of natural language processing (NLP).
Starting from the early language models and information retrieval systems, we
trace the key developments that have shaped prompt engineering over the years.
The introduction of attention mechanisms in 2015 revolutionized language
understanding, leading to advancements in controllability and
context-awareness. Subsequent breakthroughs in reinforcement learning
techniques further enhanced prompt engineering, addressing issues like exposure
bias and biases in generated text. We examine the significant contributions in
2018 and 2019, focusing on fine-tuning strategies, control codes, and
template-based generation. The paper also discusses the growing importance of
fairness, human-AI collaboration, and low-resource adaptation. In 2020 and
2021, contextual prompting and transfer learning gained prominence, while 2022
and 2023 witnessed the emergence of advanced techniques like unsupervised
pre-training and novel reward shaping. Throughout the paper, we reference
specific research studies that exemplify the impact of various developments on
prompt engineering. The journey of prompt engineering continues, with ethical
considerations being paramount for the responsible and inclusive future of AI
systems.
♻ ☆ People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection EMNLP'23
Indira Sen, Dennis Assenmacher, Mattia Samory, Isabelle Augenstein, Wil van der Aalst, Claudia Wagner
NLP models are used in a variety of critical social computing tasks, such as
detecting sexist, racist, or otherwise hateful content. Therefore, it is
imperative that these models are robust to spurious features. Past work has
attempted to tackle such spurious features using training data augmentation,
including Counterfactually Augmented Data (CADs). CADs introduce minimal
changes to existing training data points and flip their labels; training on
them may reduce model dependency on spurious features. However, manually
generating CADs can be time-consuming and expensive. Hence in this work, we
assess if this task can be automated using generative NLP models. We
automatically generate CADs using Polyjuice, ChatGPT, and Flan-T5, and evaluate
their usefulness in improving model robustness compared to manually-generated
CADs. By testing both model performance on multiple out-of-domain test sets and
individual data point efficacy, our results show that while manual CADs are
still the most effective, CADs generated by ChatGPT come a close second. One
key reason for the lower performance of automated methods is that the changes
they introduce are often insufficient to flip the original label.
comment: Preprint of EMNLP'23 paper
♻ ☆ LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Quantization is an indispensable technique for serving Large Language Models
(LLMs) and has recently found its way into LoRA fine-tuning. In this work we
focus on the scenario where quantization and LoRA fine-tuning are applied
together on a pre-trained model. In such cases it is common to observe a
consistent gap in the performance on downstream tasks between full fine-tuning
and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ
(LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that
simultaneously quantizes an LLM and finds a proper low-rank initialization for
LoRA fine-tuning. Such an initialization alleviates the discrepancy between the
quantized and full-precision model and significantly improves generalization in
downstream tasks. We evaluate our method on natural language understanding,
question answering, summarization, and natural language generation tasks.
Experiments show that our method is highly effective and outperforms existing
quantization methods, especially in the challenging 2-bit and 2/4-bit mixed
precision regimes. The code is available on https://github.com/yxli2123/LoftQ.
♻ ☆ GraphPrompt: Graph-Based Prompt Templates for Biomedical Synonym Prediction
Hanwen Xu, Jiayou Zhang, Zhirui Wang, Shizhuo Zhang, Megh Manoj Bhalerao, Yucong Liu, Dawei Zhu, Sheng Wang
In the expansion of biomedical dataset, the same category may be labeled with
different terms, thus being tedious and onerous to curate these terms.
Therefore, automatically mapping synonymous terms onto the ontologies is
desirable, which we name as biomedical synonym prediction task. Unlike
biomedical concept normalization (BCN), no clues from context can be used to
enhance synonym prediction, making it essential to extract graph features from
ontology. We introduce an expert-curated dataset OBO-syn encompassing 70
different types of concepts and 2 million curated concept-term pairs for
evaluating synonym prediction methods. We find BCN methods perform weakly on
this task for not making full use of graph information. Therefore, we propose
GraphPrompt, a prompt-based learning approach that creates prompt templates
according to the graphs. GraphPrompt obtained 37.2\% and 28.5\% improvement on
zero-shot and few-shot settings respectively, indicating the effectiveness of
these graph-based prompt templates. We envision that our method GraphPrompt and
OBO-syn dataset can be broadly applied to graph-based NLP tasks, and serve as
the basis for analyzing diverse and accumulating biomedical data. All the data
and codes are avalible at: https://github.com/HanwenXuTHU/GraphPrompt
comment: 7 pages
♻ ☆ Extending CAM-based XAI methods for Remote Sensing Imagery Segmentation
Current AI-based methods do not provide comprehensible physical
interpretations of the utilized data, extracted features, and
predictions/inference operations. As a result, deep learning models trained
using high-resolution satellite imagery lack transparency and explainability
and can be merely seen as a black box, which limits their wide-level adoption.
Experts need help understanding the complex behavior of AI models and the
underlying decision-making process. The explainable artificial intelligence
(XAI) field is an emerging field providing means for robust, practical, and
trustworthy deployment of AI models. Several XAI techniques have been proposed
for image classification tasks, whereas the interpretation of image
segmentation remains largely unexplored. This paper offers to bridge this gap
by adapting the recent XAI classification algorithms and making them usable for
muti-class image segmentation, where we mainly focus on buildings' segmentation
from high-resolution satellite images. To benchmark and compare the performance
of the proposed approaches, we introduce a new XAI evaluation methodology and
metric based on "Entropy" to measure the model uncertainty. Conventional XAI
evaluation methods rely mainly on feeding area-of-interest regions from the
image back to the pre-trained (utility) model and then calculating the average
change in the probability of the target class. Those evaluation metrics lack
the needed robustness, and we show that using Entropy to monitor the model
uncertainty in segmenting the pixels within the target class is more suitable.
We hope this work will pave the way for additional XAI research for image
segmentation and applications in the remote sensing discipline.
♻ ☆ Patent Documents to Engineering Design Knowledge Graphs
Aimed at supporting knowledge-intensive tasks in the design process,
populating design knowledge from text documents involves the extraction of
triples - head entity :: relationship :: tail entity or h :: r :: t that could
be combined into a knowledge graph representation. As relationships are largely
chosen from ontological or common-sense alternatives, knowledge graphs built
using these depict an approximation or restricted view of design knowledge,
rather than what is explicated in text document. In this article, we present a
data-driven approach to identify and explicate facts (h :: r :: t) from
sentences in patent documents. We create a dataset of 44,227 sentences and
facts, encompassing all patent classifications while also capturing the
variations among patent document sections. Using this dataset, we train taggers
that classify tokens to: 1) identify all entities (h) and relationships (r) and
2) specific relationships (r) for a pair of entities (h :: ___ :: t). While
these taggers are built upon transformer-based sequence classification models,
we evaluate our proposed method against edge classification approaches that use
linear classifiers and graph neural networks, incorporating transformer-based
token embeddings and linguistic features. The simplicity and coverage of the
proposed method enable its application to patent documents at any scale and
variety. Upon deploying an open-source python package, we apply our method to
patent documents related to fan systems. From the knowledge graphs thus
extracted, we explain how facts could be generalised to domain ontologies as
well as be specified to subsystem levels. We also highlight the importance of
knowledge graph representations by retrieving and explicating the knowledge of
key issues in fan systems, while holding a comparative discussion against
opinions from ChatGPT.
♻ ☆ A Survey of Graph Meets Large Language Model: Progress and Future Directions
Graph plays a significant role in representing and analyzing complex
relationships in real-world applications such as citation networks, social
networks, and biological data. Recently, Large Language Models (LLMs), which
have achieved tremendous success in various domains, have also been leveraged
in graph-related tasks to surpass traditional Graph Neural Networks (GNNs)
based methods and yield state-of-the-art performance. In this survey, we first
present a comprehensive review and analysis of existing methods that integrate
LLMs with graphs. First of all, we propose a new taxonomy, which organizes
existing methods into three categories based on the role (i.e., enhancer,
predictor, and alignment component) played by LLMs in graph-related tasks. Then
we systematically survey the representative methods along the three categories
of the taxonomy. Finally, we discuss the remaining limitations of existing
studies and highlight promising avenues for future research. The relevant
papers are summarized and will be consistently updated at:
https://github.com/yhLeeee/Awesome-LLMs-in-Graph-tasks.
comment: Work in progress; 13 pages, 5 figures
♻ ☆ Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond MDM
Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Tianyu Liu, Baobao Chang
In this study, we explore the potential of Multimodal Large Language Models
(MLLMs) in improving embodied decision-making processes for agents. While Large
Language Models (LLMs) have been widely used due to their advanced reasoning
skills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual
understanding and reasoning capabilities. We investigate whether
state-of-the-art MLLMs can handle embodied decision-making in an end-to-end
manner and whether collaborations between LLMs and MLLMs can enhance
decision-making. To address these questions, we introduce a new benchmark
called PCA-EVAL, which evaluates embodied decision-making from the perspectives
of Perception, Cognition, and Action. Additionally, we propose HOLMES, a
multi-agent cooperation framework that allows LLMs to leverage MLLMs and APIs
to gather multimodal information for informed decision-making. We compare
end-to-end embodied decision-making and HOLMES on our benchmark and find that
the GPT4-Vision model demonstrates strong end-to-end embodied decision-making
abilities, outperforming GPT4-HOLMES in terms of average decision accuracy
(+3%). However, this performance is exclusive to the latest GPT4-Vision model,
surpassing the open-source state-of-the-art MLLM by 26%. Our results indicate
that powerful MLLMs like GPT4-Vision hold promise for decision-making in
embodied agents, offering new avenues for MLLM research. Code and data are open
at https://github.com/pkunlp-icler/PCA-EVAL/.
comment: FMDM@NeurIPS2023, Code and data:
https://github.com/pkunlp-icler/PCA-EVAL/
♻ ☆ Just ClozE! A Novel Framework for Evaluating the Factual Consistency Faster in Abstractive Summarization
The issue of factual consistency in abstractive summarization has received
extensive attention in recent years, and the evaluation of factual consistency
between summary and document has become an important and urgent task. Most of
the current evaluation metrics are adopted from the question answering (QA) or
natural language inference (NLI) task. However, the application of QA-based
metrics is extremely time-consuming in practice while NLI-based metrics are
lack of interpretability. In this paper, we propose a cloze-based evaluation
framework called ClozE and show the great potential of the cloze-based metric.
It inherits strong interpretability from QA, while maintaining the speed of
NLI- level reasoning. We demonstrate that ClozE can reduce the evaluation time
by nearly 96% relative to QA-based metrics while retaining their
interpretability and performance through experiments on six human-annotated
datasets and a meta-evaluation benchmark GO FIGURE (Gabriel et al., 2021).
Finally, we discuss three important facets of ClozE in practice, which further
shows better overall performance of ClozE compared to other metrics.
comment: The manuscript for JAIR
♻ ☆ CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules
Large Language Models (LLMs) have already become quite proficient at solving
simpler programming tasks like those in HumanEval or MBPP benchmarks. However,
solving more complex and competitive programming tasks is still quite
challenging for these models - possibly due to their tendency to generate
solutions as monolithic code blocks instead of decomposing them into logical
sub-tasks and sub-modules. On the other hand, experienced programmers
instinctively write modularized code with abstraction for solving complex
tasks, often reusing previously developed modules. To address this gap, we
propose CodeChain, a novel framework for inference that elicits modularized
code generation through a chain of self-revisions, each being guided by some
representative sub-modules generated in previous iterations. Concretely,
CodeChain first instructs the LLM to generate modularized codes through
chain-of-thought prompting. Then it applies a chain of self-revisions by
iterating the two steps: 1) extracting and clustering the generated sub-modules
and selecting the cluster representatives as the more generic and re-usable
implementations, and 2) augmenting the original chain-of-thought prompt with
these selected module-implementations and instructing the LLM to re-generate
new modularized solutions. We find that by naturally encouraging the LLM to
reuse the previously developed and verified sub-modules, CodeChain can
significantly boost both modularity as well as correctness of the generated
solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on
CodeContests. It is shown to be effective on both OpenAI LLMs as well as
open-sourced LLMs like WizardCoder. We also conduct comprehensive ablation
studies with different methods of prompting, number of clusters, model sizes,
program qualities, etc., to provide useful insights that underpin CodeChain's
success.
♻ ☆ On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, Yu Qiao
The pursuit of autonomous driving technology hinges on the sophisticated
integration of perception, decision-making, and control systems. Traditional
approaches, both data-driven and rule-based, have been hindered by their
inability to grasp the nuance of complex driving environments and the
intentions of other road users. This has been a significant bottleneck,
particularly in the development of common sense reasoning and nuanced scene
understanding necessary for safe and reliable autonomous driving. The advent of
Visual Language Models (VLM) represents a novel frontier in realizing fully
autonomous vehicle driving. This report provides an exhaustive evaluation of
the latest state-of-the-art VLM, GPT-4V(ision), and its application in
autonomous driving scenarios. We explore the model's abilities to understand
and reason about driving scenes, make decisions, and ultimately act in the
capacity of a driver. Our comprehensive tests span from basic scene recognition
to complex causal reasoning and real-time decision-making under varying
conditions. Our findings reveal that GPT-4V demonstrates superior performance
in scene understanding and causal reasoning compared to existing autonomous
systems. It showcases the potential to handle out-of-distribution scenarios,
recognize intentions, and make informed decisions in real driving contexts.
However, challenges remain, particularly in direction discernment, traffic
light recognition, vision grounding, and spatial reasoning tasks. These
limitations underscore the need for further research and development. Project
is now available on GitHub for interested parties to access and utilize:
\url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}
♻ ☆ A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity AACL 2023
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung
This paper proposes a framework for quantitatively evaluating interactive
LLMs such as ChatGPT using publicly available data sets. We carry out an
extensive technical evaluation of ChatGPT using 23 data sets covering 8
different common NLP application tasks. We evaluate the multitask, multilingual
and multi-modal aspects of ChatGPT based on these data sets and a newly
designed multimodal dataset. We find that ChatGPT outperforms LLMs with
zero-shot learning on most tasks and even outperforms fine-tuned models on some
tasks. We find that it is better at understanding non-Latin script languages
than generating them. It is able to generate multimodal content from textual
prompts, via an intermediate code generation step. Moreover, we find that
ChatGPT is 63.41% accurate on average in 10 different reasoning categories
under logical reasoning, non-textual reasoning, and commonsense reasoning,
hence making it an unreliable reasoner. It is, for example, better at deductive
than inductive reasoning. ChatGPT suffers from hallucination problems like
other LLMs and it generates more extrinsic hallucinations from its parametric
memory as it does not have access to an external knowledge base. Finally, the
interactive feature of ChatGPT enables human collaboration with the underlying
LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++
on machine translation, in a multi-turn "prompt engineering" fashion. We also
release codebase for evaluation set extraction.
comment: 45 pages, AACL 2023
♻ ☆ FELM: Benchmarking Factuality Evaluation of Large Language Models NeurIPS 2023
Assessing factuality of text generated by large language models (LLMs) is an
emerging yet crucial research area, aimed at alerting users to potential errors
and guiding the development of more reliable LLMs. Nonetheless, the evaluators
assessing factuality necessitate suitable evaluation themselves to gauge
progress and foster advancements. This direction remains under-explored,
resulting in substantial impediments to the progress of factuality evaluators.
To mitigate this issue, we introduce a benchmark for Factuality Evaluation of
large Language Models, referred to as felm. In this benchmark, we collect
responses generated from LLMs and annotate factuality labels in a fine-grained
manner. Contrary to previous studies that primarily concentrate on the
factuality of world knowledge (e.g.~information from Wikipedia), felm focuses
on factuality across diverse domains, spanning from world knowledge to math and
reasoning. Our annotation is based on text segments, which can help pinpoint
specific factual errors. The factuality annotations are further supplemented by
predefined error types and reference links that either support or contradict
the statement. In our experiments, we investigate the performance of several
LLM-based factuality evaluators on felm, including both vanilla LLMs and those
augmented with retrieval mechanisms and chain-of-thought processes. Our
findings reveal that while retrieval aids factuality evaluation, current LLMs
are far from satisfactory to faithfully detect factual errors.
comment: Accepted by NeurIPS 2023 Track on Datasets and Benchmarks
♻ ☆ Post-hoc Interpretability for Neural NLP: A Survey
Neural networks for NLP are becoming increasingly complex and widespread, and
there is a growing concern if these models are responsible to use. Explaining
models helps to address the safety and ethical concerns and is essential for
accountability. Interpretability serves to provide these explanations in terms
that are understandable to humans. Additionally, post-hoc methods provide
explanations after a model is learned and are generally model-agnostic. This
survey provides a categorization of how recent post-hoc interpretability
methods communicate explanations to humans, it discusses each method in-depth,
and how they are validated, as the latter is often a common concern.
♻ ☆ Pre-training Language Models for Comparative Reasoning EMNLP 2023
Comparative reasoning is a process of comparing objects, concepts, or
entities to draw conclusions, which constitutes a fundamental cognitive
ability. In this paper, we propose a novel framework to pre-train language
models for enhancing their abilities of comparative reasoning over texts. While
there have been approaches for NLP tasks that require comparative reasoning,
they suffer from costly manual data labeling and limited generalizability to
different tasks. Our approach introduces a novel method of collecting scalable
data for text-based entity comparison, which leverages both structured and
unstructured data. Moreover, we present a framework of pre-training language
models via three novel objectives on comparative reasoning. Evaluation on
downstream tasks including comparative question answering, question generation,
and summarization shows that our pre-training framework significantly improves
the comparative reasoning abilities of language models, especially under
low-resource conditions. This work also releases the first integrated benchmark
for comparative reasoning.
comment: EMNLP 2023 - Camera Ready. Typos fixed
♻ ☆ Using large language models to study human memory for meaningful narratives
One of the most impressive achievements of the AI revolution is the
development of large language models that can generate meaningful text and
respond to instructions in plain English with no additional training necessary.
Here we show that language models can be used as a scientific instrument for
studying human memory for meaningful material. We developed a pipeline for
designing large scale memory experiments and analyzing the obtained results. We
performed online memory experiments with a large number of participants and
collected recognition and recall data for narratives of different lengths. We
found that both recall and recognition performance scale linearly with
narrative length. Furthermore, in order to investigate the role of narrative
comprehension in memory, we repeated these experiments using scrambled versions
of the presented stories. We found that even though recall performance declined
significantly, recognition remained largely unaffected. Interestingly, recalls
in this condition seem to follow the original narrative order rather than the
scrambled presentation, pointing to a contextual reconstruction of the story in
memory.
comment: v2: 43 pages, with added discussion and a new appendix C
♻ ☆ Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations
Existing research predominantly focuses on developing powerful language
learning models (LLMs) for mathematical reasoning within monolingual languages,
with few explorations in preserving efficacy in a multilingual context. To
bridge this gap, this paper pioneers exploring and training powerful
Multilingual Math Reasoning (xMR) LLMs. Firstly, by utilizing translation, we
construct the first multilingual math reasoning instruction dataset,
MGSM8KInstruct, encompassing ten distinct languages, thus addressing the issue
of training data scarcity in xMR tasks. Based on the collected dataset, we
propose different training strategies to build powerful xMR LLMs, named
MathOctopus, notably outperform conventional open-source LLMs and exhibit
superiority over ChatGPT in few-shot scenarios. Notably, MathOctopus-13B
reaches 47.6% accuracy which exceeds ChatGPT 46.3% on MGSM testset. Beyond
remarkable results, we unearth several pivotal observations and insights from
extensive experiments: (1) When extending the rejection sampling strategy to
the multilingual context, it proves effective for model performances, albeit
limited. (2) Employing parallel corpora for math Supervised Fine-Tuning (SFT)
across multiple languages not only significantly enhances model performance
multilingually but also elevates their monolingual performance. This indicates
that crafting multilingual corpora can be regarded as a vital strategy for
enhancing model performance in a specific language, especially in mathematical
reasoning tasks. For instance, MathOctopus-7B improves its counterparts that
trained on English from 42.2% to 50.8% on GSM8K testset.
comment: Work in Progress
♻ ☆ Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement
Linlu Qiu, Liwei Jiang, Ximing Lu, Melanie Sclar, Valentina Pyatkin, Chandra Bhagavatula, Bailin Wang, Yoon Kim, Yejin Choi, Nouha Dziri, Xiang Ren
The ability to derive underlying principles from a handful of observations
and then generalize to novel situations -- known as inductive reasoning -- is
central to human intelligence. Prior work suggests that language models (LMs)
often fall short on inductive reasoning, despite achieving impressive success
on research benchmarks. In this work, we conduct a systematic study of the
inductive reasoning capabilities of LMs through iterative hypothesis
refinement, a technique that more closely mirrors the human inductive process
than standard input-output prompting. Iterative hypothesis refinement employs a
three-step process: proposing, selecting, and refining hypotheses in the form
of textual rules. By examining the intermediate rules, we observe that LMs are
phenomenal hypothesis proposers (i.e., generating candidate rules), and when
coupled with a (task-specific) symbolic interpreter that is able to
systematically filter the proposed set of rules, this hybrid approach achieves
strong results across inductive reasoning benchmarks that require inducing
causal relations, language-like instructions, and symbolic concepts. However,
they also behave as puzzling inductive reasoners, showing notable performance
gaps between rule induction (i.e., identifying plausible rules) and rule
application (i.e., applying proposed rules to instances), suggesting that LMs
are proposing hypotheses without being able to actually apply the rules.
Through empirical and human analyses, we further reveal several discrepancies
between the inductive reasoning processes of LMs and humans, shedding light on
both the potentials and limitations of using LMs in inductive reasoning tasks.
♻ ☆ On the Performance of Multimodal Language Models
Instruction-tuned large language models (LLMs) have demonstrated promising
zero-shot generalization capabilities across various downstream tasks. Recent
research has introduced multimodal capabilities to LLMs by integrating
independently pretrained vision encoders through model grafting. These
multimodal variants undergo instruction tuning, similar to LLMs, enabling
effective zero-shot generalization for multimodal tasks. This study conducts a
comparative analysis of different multimodal instruction tuning approaches and
evaluates their performance across a range of tasks, including complex
reasoning, conversation, image captioning, multiple-choice questions (MCQs),
and binary classification. Through rigorous benchmarking and ablation
experiments, we reveal key insights for guiding architectural choices when
incorporating multimodal capabilities into LLMs. However, current approaches
have limitations; they do not sufficiently address the need for a diverse
multimodal instruction dataset, which is crucial for enhancing task
generalization. Additionally, they overlook issues related to truthfulness and
factuality when generating responses. These findings illuminate current
methodological constraints in adapting language models for image comprehension
and provide valuable guidance for researchers and practitioners seeking to
harness multimodal versions of LLMs.
♻ ☆ Certifying LLM Safety against Adversarial Prompting
Large language models (LLMs) released for public use incorporate guardrails
to ensure their output is safe, often referred to as "model alignment." An
aligned language model should decline a user's request to produce harmful
content. However, such safety measures are vulnerable to adversarial attacks,
which add maliciously designed token sequences to a harmful prompt to bypass
the model's safety guards. In this work, we introduce erase-and-check, the
first framework to defend against adversarial prompts with verifiable safety
guarantees. We defend against three attack modes: i) adversarial suffix, which
appends an adversarial sequence at the end of the prompt; ii) adversarial
insertion, where the adversarial sequence is inserted anywhere in the middle of
the prompt; and iii) adversarial infusion, where adversarial tokens are
inserted at arbitrary positions in the prompt, not necessarily as a contiguous
block. Our experimental results demonstrate that this procedure can obtain
strong certified safety guarantees on harmful prompts while maintaining good
empirical performance on safe prompts. For example, against adversarial
suffixes of length 20, it certifiably detects 92% of harmful prompts and labels
94% of safe prompts correctly using the open-source language model Llama 2 as
the safety filter. We further improve the filter's performance, in terms of
accuracy and speed, by replacing Llama 2 with a DistilBERT safety classifier
fine-tuned on safe and harmful prompts. Additionally, we propose two efficient
empirical defenses: i) RandEC, a randomized version of erase-and-check that
evaluates the safety filter on a small subset of the erased subsequences, and
ii) GradEC, a gradient-based version that optimizes the erased tokens to remove
the adversarial sequence. The code for our experiments is available at
https://github.com/aounon/certified-llm-safety.